The Proper Care and Feeding of CAMELS: How Limited Training Data Affects Streamflow Prediction

This paper investigates the influence of the number of training basins and the training period length on the model performance for the EA-LSTM and XGBoost

Abstract

Accurate streamflow prediction largely relies on historical meteorological records and streamflow measurements. For many regions, however, such data are only scarcely available. To train an appropriate model for a region with given amounts of data, it is therefore indispensable to know a model’s sensitivity to limited training data, both in terms of geographic diversity and different time spans. In this study, we provide decision support for tree- and LSTM-based models. We feed the models meteorological observations disseminated with the CAMELS dataset, and individually restrict the training period length, number of training basins, and input sequence length. Our findings show that tree- and LSTM-based models provide similarly accurate predictions on small datasets, while LSTMs are superior given more training data. We further quantify how additional training data improve predictions and estimate how many previous days of forcings we should feed the models to obtain best predictions for each training set size.

Paper

Gauch, M. and Mai, J. and Lin, J. “The proper care and feeding of CAMELS: How limited training data affects streamflow prediction.” arXiv preprint arXiv:1911.07249 (2019).

This paper is now published in Environmental Modelling & Software, Volume 135, 2021.

Code

Code to reproduce the results of this paper can be found in this GitHub repository.

Citation

@misc{gauch2019proper,
    author={Martin Gauch and Juliane Mai and Jimmy Lin},
    journal={Environmental Modelling \& Software},
    volume={135},
    pages={104926},
    year={2021},
    issn={1364-8152,
    doi={10.1016/j.envsoft.2020.104926},
    url={https://www.sciencedirect.com/science/article/pii/S136481522030983X}
}